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ABSTRACT 


Most  pitch  excited  channel  vocoders  require  the  fundamental  or  laryngeal  frequency 
of  the  input  speech  to  be  present  if  the  output  speech  is  to  be  of  high  quality.  In 
order  to  determine  if  speech  whose  fundamental  is  absent  can  have  its  pitch  accu¬ 
rately  restored  so  as  to  be  used  as  an  input  to  a  vocoder,  a  computer  simulation 
was  performed.  The  fundamental  was  restored  by  passing  the  speech  through  a 
fullwave  rectifier  followed  by  a  slope  filter.  The  accuracy  of  the  pitch  restora¬ 
tion  of  this  method  was  compared  with  that  of  simply  measuring  the  pitch  of  speech 
whose  fundamental  was  present  by  slope  filtering  alone.  A  third  pitch  detection 
method,  that  of  visually  displaying  the  speech  waveform  and  determining  the  pitch 
by  eye,  was  also  used  as  a  comparison.  Pitch  contours  of  the  three  methods  indicate 
that  pitch  restored  by  fullwave  rectification  and  slope  filtering  has  larger  perturba¬ 
tions  than  pitch  as  detected  by  slope  filtering  alone.  Both  methods  produced  pitch 
contours  having  much  larger  perturbations  than  pitch  determined  visually. 

Speech  whose  pitch  was  determined  by  the  above  three  methods  was  used  to  excite 
the  spectrally  flattened  Lincoln  Laboratory  Vocoder.  Listening  tests  of  the  vocoder 
output  indicate  that  pitch  restored  by  fullwave  rectification  and  slope  filtering  pro¬ 
duced  rougher  sounding  speech  than  speech  whose  pitch  was  detected  by  slope  filtering 
alone,  but  both  methods  produced  speech  having  considerably  more  audible  roughness 
than  that  produced  by  visually  detected  pitch.  Finally,  the  sophisticated  pitch  detector 
of  the  vocoder  itself  produced  speech  of  quality  comparable  to  that  determined  visually. 


Accepted  for  the  Air  Force 

Franklin  C.  Hudson 

Chief,  Lincoln  Laboratory  Office 
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VOCODED  SPEECH  IN  THE  ABSENCE  OF  THE  LARYNGEAL  FREQUENCY 


I.  INTRODUCTION 

Recent  interest  in  speech  bandwidth  compression  devices  has  renewed  the 

search  for  accurate  methods  of  extraction  of  the  laryngeal  frequency  or  pitch  from 

human  speech.  One  of  the  most  widely  known  speech  bandwidth  compression 
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devices  is  the  vocoder,  ’  ’  ’  invented  by  Homer  Dudley  in  the  1930's.  To  effect 

a  bandwidth  compression  of  the  speech,  the  vocoder  uses  the  fact  that  speech,  at 

least  in  the  steady  state,  has  a  frequency  spectrum  consisting  of  a  fundamental  and 

harmonics  of  this  fundamental  (see  Fig.  1).  The  successful  operation  of  the  vocoder 

depends  critically  on  the  accurate  detection  of  this  fundamental,  or  pitch,  of  the 

input  speech.  Early  vocoder  attempts  to  compress  the  bandwidth  of  speech  produced 

speech  of  unreliable  quality  but  recent  attempts  have  shown  that  this  reliability  can 

be  improved.  To  a  large  degree,  this  improvement  in  vocoded  speech  can  be  traced 
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to  the  development  of  devices  that  more  accurately  detect  pitch0’  ’  ’  ’  than  those 
previously  employed. 

Many  pitch  detection  schemes  require  that  the  speech  fundamental  be  present 
in  the  input  spectrum  to  the  vocoder.  However,  sometimes  this  fundamental  is 
filtered  out  so  that  the  input  speech  to  the  vocoder  has  no  fundamental  present.  For 
example,  the  speech  fundamental  generally  is  between  50-400  Hz,  ***  but  telephone 
lines  have  a  lower  cutoff  of  about  300  Hz.  Hence, the  speech  at  the  telephone  central 
office,  which  is  where  the  vocoder  could  most  profitably  be  situated,  may  often  not 
contain  a  fundamental. 

II.  THE  PURPOSE  OF  THE  THESIS 

The  primary  intent  of  this  paper  is  to  examine  the  possibility  of  reconstruc¬ 
tion  of  an  accurate  speech  fundamental  where  the  original  has  been  removed  as  in 
the  above  example.  If  this  can  be  done,  it  will  then  be  possible  to-  detect  the  pitch 
in  speech  with  the  fundamental  absent  by  first  restoring  the  fundamental  and  then 
using  present  pitch  detection  techniques. 

McKinney*  has  hypothesized  that  steady  state  speech  can  have  its  fundamental 
restored  if  it  has  previously  been  removed  if 

M 

t>  nA^  £  A,  (1) 

n=2  n  1 
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Fig.  1.  Logarithmic  Spectrum  of  "ee"  Sound  in  See. 
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where  n  =  the  harmonic  number 

=  the  amplitude  of  the  n1*1  harmonic 
M  =  the  highest  harmonic  present  in  the  filtered  speech 

If  this  result  is  not  satisfied  then  McKinney  states  that  accurate  construction 
of  the  fundamental  cannot  be  assured.  He  states  that  a  computer  simulation  has 
validated  this  result  for  some  special  cases  and  that  actual  measurement  of  the 
fundamental  can  be  made  by  passing  the  speech  through  a  non-linear  device.  The 
pitch  can  then  be  obtained  by  applying  conventional  pitch  extraction  techniques  to 
the  output.  Mathematical  analysis  seems  to  indicate  that  a  full  wave  rectifier  would 
be  best  although  any  even  power  non-linear  device  should  perform  well,  assuming, 
of  course,  that  the  filtered  speech  satisfies  Eq.  (1). 

Speech  is  not  completely  of  a  steady  state  nature  and  McKinney's  analysis 
does  not  cover  the  situations  where  the  pitch  is  varying  in  time.  To  determine  what 
happens  when  speech  has  its  fundamental  removed  and  then  restored  by  a  non-linear 
device  a  computer  simulation  was  performed  and  the  results  described  in  this  paper. 
These  results  are  compared  with  the  pitch  detected  by  several  methods  wherein  the 
speech  fundamental  was  present.  The  pitch  data  of  the  various  pitch  detection  schemes 
along  with  the  original  speech  was  used  to  excite  a  vocoder  and  subjective  listening 
tests  performed  on  the  outputs  to  determine  what  effect  reconstruction  of  the  funda¬ 
mental  has  on  speech  quality. 

III.  THE  EXPERIMENT 

The  first  part  of  the  experiment  is  a  computer  simulation  of  three  pitch  detec¬ 
tion  schemes  to  determine  the  effects  on  the  pitch  of  speech  whose  fundamental  has 
been  filtered  out  but  has  later  been  restored.  The  computer  used  in  this  simulation 
was  MIT's  Electrical  Engineering  Department's  PDP-1.  For  a  brief  description  of 
some  of  the  features  of  this  facility  see  Appendix  1. 

A.  Speech  Input  to  Computer 

In  order  to  detect  pitch  by  computer  it  was  first  necessary  to  convert  the 
analog  speech  signal  into  numerical  data  so  that  it  could  be  stored  and  manipulated 
by  the  computer.  This  was  performed  using  an  8-bit  analog  to  digital  (A-D)  con¬ 
verter  with  timing  provided  by  the  computer  program  and  the  PDP-1.  For  a  more 
complete  description  of  this  operation  see  Appendix  2.  The  A-D  converter 
sampled  the  speech  at  a  10  kHz  rate  and  computer  memory  space  was  sufficient  to 
store  1-1/2  seconds  of  sampled  speech  at  this  rate. 
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Fig.  2.  Frequency  Response  of  the  Ideal  Slope  Filter 
With  12  db/ Octave  Falloff  Above  80  Hz. 


Fig.  3.  0-3000  Hz  Speech  Signal 
(150  ms/ Line)  of  "We  Axe  Due 
About  Eight"  —  Speaker  1. 
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Fig.  4.  Pitch  Detection  in  the  Absence  of  the  Fundamental  Frequency. 
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B.  Description  of  Three  Pitch  Detection  Techniques 

1.  Pitch  Detection  in  the  Absence  of  the  Fundamental 

To  detect  pitch  when  the  fundamental  was  not  present  an  approach  illustrated 

in  Fig.  4  was  used.  First,  to  assure  that  no  fundamental  was  present,  the  speech 
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data  was  filtered  by  a  300-900  Hz  7-pole  Lerner  filter.  For  a  discussion  of  the 
digital  filters  used  in  this  experiment  see  Appendix  3.  The  resulting  300-900  Hz 
waveform  was  then  fullwave  rectified  to  restore  the  fundamental  and  then  sent 
through  a  slope  filter  having  a  slope  of  12  db/octave  beginning  at  about  80  Hz  as 
illustrated  in  Fig.  2.  Since  the  fundamental  generally  was  between  50  and  400  Hz, 
this  slope  filter  would  enhance  the  fundamental  with  respect  to  its  harmonics,  pro¬ 
vided  the  pitch  was  above  80  Hz.  As  a  result,  the  number  of  local  maxima  per 
second  of  the  output  of  the  slope  filter  was  usually  proportional  to  the  pitch  of  the 
speech  wave.  * 

The  output  of  this  slope  filter  was  then  examined  for  local  maxima.  The  time 
differences  between  these  maxima  represented  the  local  period  of  the  speech  wave. 
The  original  speech  wave  and  the  pitch  periods  were  then  displayed  on  the  scope 
face  of  the  computer  in  a  manner  similar  to  that  illustrated  in  Fig.  3.  Any  data 
points  that  represented  gross  errors  like  frequency  doubling  were  eliminated.  This 
speech  data  was  then  recorded  on  audio  tape  in  a  method  described  in  section  III  C. 

2.  Pitch  Detection  With  a  Slope  Filter  in  the  Presence  of 
a  Fundamental 

As  a  comparison  against  the  detection  scheme  described  above,  a  second 
pitch  detection  technique  was  used  in  which  the  speech  fundamental  was  present 
(see  Fig.  5).  In  this  scheme  the  300-900  Hz  filter  and  fullwave  rectifier  were 
omitted.  The  speech  signal  was  sent  through  the  same  slope  filter  used  above, 
followed  by  the  local  maximum  detector.  Again  the  original  speech  waveforms  plus 
the  pitch  periods  were  visibly  displayed  and  gross  pitch  errors  were  removed.  The 
speech  and  pitch  indications  were  then  recorded  on  tape  as  described  in  section  III  C. 

^However,  this  was  not  always  the  case.  For  example,  if  the  speech  signal  had 
a  first  harmonic  that  was  12  db  above  the  fundamental  then  the  output  of  the  slope 
filter  would  have  the  fundamental  and  first  harmonic  of  equal  energies  and  the 
number  of  local  maximums/second  of  this  output  would  then  not  be  proportional  to 
the  fundamental  alone. 
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Fig.  5.  Pitch  Detection  by  Slope  Filtering 
With  the  Fundamental  Present. 
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Fig.  6.  Pitch  Detection  by  Visual  Means. 


PITCH 

PULSES 


Fig.  7.  System  for  Recording  Speech  and  Pitch 
Pulses  on  a  Two  Track  Audio  Tape. 
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3. 


Pitch  Detection  by  Eye  with  the  Fundamental  Present 


As  a  further  check  against  both  of  the  above  methods  the  speech  was  visibly 
displayed^  and  the  periods  recorded  in  the  computer  memory.  To  facilitate  this, 
the  original  speech  was  first  filtered  by  using  the  300-900  Hz  Lerner  filter 
described  earlier  (see  Fig.  6).  Next,  distances  between  local  maximum  were 
recorded  with  the  aid  of  the  local  maximum  locator  used  in  the  other  two  pitch 
detection  schemes.  These  maximum  points  and  the  filtered  wave  were  displayed 
and  all  indications  of  local  maximum  that  did  not  correspond  to  periods  of  this  wave 
were  discarded.  These  apparently  correct  pitch  indications  and  the  speech  were 
then  recorded  on  audio  tape  in  the  method  described  below. 

C.  Read  Out  of  Pitch  Data  and  the  Lincoln  Laboratory  Vocoder 

The  final  result  of  the  computer  simulations  described  above  was  a  two-track 
audio  tape  recording.  One  track  contained  the  original  analog  speech  while  the  second 
track  contained  analog  pitch  pulses  that  were  synchronized  with  the  analog  speech 
(see  Fig.  7  and  Appendix  2).  This  two-track  recording  was  then  used  as  input  to  the 
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Lincoln  Laboratory  vocoder,  which  contains  spectral  flattening  in  its  synthesizer 
stage.  The  analog  speech  was  fed  into  the  analyzer  stage  of  the  vocoder  in  the  con¬ 
ventional  manner  while  the  analog  pitch  pulses  were  fed  into  the  synthesizer  stage  of 
the  vocoder  as  shown  in  Fig.  8.  Thus,  by  disconnecting  the  pitch  detector  of  the 
vocoder,  the  pitch  pulses  generated  by  the  computer  simulation  could  be  substituted 
for  those  ordinarily  generated  by  the  vocoder  itself.  In  practice  the  filters  of  the 
analyzer  stage  delayed  the  speech  with  respect  to  the  pitch  by  approximately  10  ms. 

It  was  felt  that  this  was  of  only  minor  importance  in  the  experiment  and  consequently 
no  attempt  was  made  to  correct  this  by  also  delaying  the  pitch  pulses.  Audio  tape 
recordings  were  made  of  the  vocoder  outputs  for  each  of  the  three  pitch  detection 
schemes  described  above.  For  reference,  recordings  were  also  made  of  tjie  vocoder 
output  with  excitation  derived  from  its  own  pitch  extractor. 

IV.  RESULTS 

Three  sentences,  each  of  about  one  second’s  duration,  were  used  to  compare 
the  various  pitch  detection  schemes.  The  sentences  used  were: 

1.  We  are  due  about  eight. 

2.  Whom  am  I  to  meet? 

3.  I  love  you! 
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VOCODER  SYNTHESIZER 


Fig.  8.  Construction  of  Synthesized  Speech  Employing  Computer  Detected  Pitch 
and  the  Lincoln  Laboratory  Vocoder. 
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Each  sentence  was  spoken  once  by  three  different  speakers  and  recorded  on  audio 
tape.  It  was  this  tape  that  was  used  as  input  to  the  computer  and  to  one  track  of  the 
two -track  tape  recording  used  to  excite  the  Lincoln  Laboratory  vocoder. 

A  new  audio  tape  recording  was  made  of  the  vocoder  output  and  subjective 
tests  were  performed.  This  audio  tape  recording  was  made  up  of  groups  of  two 
closely  spaced  sentences.  Each  of  the  two  sentences  were  identical  (i.  e.  the  same 
speaker  and  same  words)  except  for  the  pitch  detection  scheme  used  to  excite  the 
vocoder.  The  pitch  detection  schemes  were 

A.  Construction  of  the  fundamental  by  fullwave  rectification  and  slope 
filtering  (fundamental  originally  absent). 

B.  Slope  filtering  with  the  fundamental  present. 

C.  Eye  detection  of  pitch. 

D.  Normal  pitch  detection  by  the  vocoder's  pitch  detector. 

Ten  (10)  subjects  were  chosen  to  listen  to  the  recorded  vocoder  outputs.  These  ten 
listeners  included  people  who  have  very  little  knowledge  of  speech  mechanisms  and 
those  who  can  discern  subtle  differences  in  speech  quality.  The  listeners  were  asked 
which  sentence  in  each  group  of  the  two  sentences  they  thought  sounded  less  rough 
in  quality  or,  if  both  were  equally  rough,  which  they  preferred.  They  could  also 
indicate  that  they  had  no  preference  between  the  two  sentences.  This  comparison 
was  performed  on  sixty-eight  (68)  two  sentence  combinations. 

The  following  comparisons  of  pitch  detection  schemes  were  made  for  each  of 
three  sentences  and  each  of  three  speakers. 

Comparison  of  Pitch  Detection  Techniques 

Comparison  1 

(A)  fundamental  absent,  fullwave  rectification  and  slope  filtering  compared  to 

(B)  slope  filter -fundamental  present 

Comparison  2 

(A)  fundamental  absent -fullwave  rectification  and  slope  filtering  compared  to 
(D)  vocoder  detected  pitch -fundamental  present 

Comparison  3 

(B)  slope  filter -fundamental  present  compared  to 
(D)  vocoder  detected  pitch -fundamental  present 
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Comparison  4 


(C)  eye  detected  pitch  compared  to 

(D)  vocoder  detected  pitch -fundamental  present 

For  each  comparison  of  the  form  type  X  compared  with  type  Y  a  second  com¬ 
parison  at  a  later  point  in  the  test  of  the  form  type  Y  compared  with  type  X  was 
done.  This  tended  to  cancel  effects  of  one  pitch  detection  scheme  appearing  first 
in  a  comparison  and  the  other  second  as  the  reverse  comparison  was  also  made. 

The  results  appear  below  for  ten  subjects.  One  of  the  two-track  tape  record¬ 
ings  of  sentence  3,  speaker  1  comparison  B  was  accidentally  destroyed  so  this 
comparison  could  not  be  made. 

Comparison  of  Four  Pitch  Detection  Techniques 


#  times  preferred 

A  to  B 
44 

#  times  preferred 

A  to  D 
28 

#  times  preferred 

B  to  D 
20 

#  times  preferred 

C  to  D 
42 


#  times  preferred 

B  to  A 
69 

#  times  preferred 

D  to  A 
108 

#  times  preferred 

D  to  B 

101 

#  times  preferred 

D  to  C 

81 


#  times  preferred 

neither 

47 

#  times  preferred 

neither 

44 

#  times  preferred 

neither 

39 

#  times  preferred 

neither 

57 


(See  text  for  meanings  of  comparisons  A,  B,  C,  D). 

#  times  preferred  B  _  ^  „ 

#  times  preferred  A 

#  times  preferred  D  _  ^ 

W  times  preferred  A 

#  times  preferred  D  _  ^  nc. 

#  times  preferred  B 

#  times  preferred  D  .  j 

#  times  preferred  C 


10 


All  tests  involving  the  comparison  of  the  pitch  detector  of  the  vocoder  with  another 
technique  showed  that  the  sophisticated  pitch  detector  of  the  vocoder  produced  speech 
of  quality  superior  to  any  other  pitch  detection  scheme  used  in  the  experiment.  It 
also  showed  that  eye  detected  pitch  was  better  than  either  of  the  other  two  computer 
generated  pitch  schemes  by  a  factor  of  almost  2  to  1. 

The  comparison  also  showed  that  the  listeners  preferred  vocoded  speech  that 
resulted  from  speech  whose  pitch  was  present  and  detected  by  slope  filtering 
techniques  to  speech  whose  pitch  was  restored  by  full -wave  rectification  and  then 
slope  filtered  and  detected.  Thus,  it  appears  that  the  pitch  resulting  from  artificial 
construction  of  the  fundamental  does  not  produce  speech  of  quality  comparable  to 
that  produced  by  pitch  detected  from  speech  whose  fundamental  has  not  been  pre¬ 
viously  destroyed.  Hence,  an  attempt  to  reconstruct  an  accurate  fundamental  by 
full  wave  rectification  does  not  appear  to  have  been  successful. 

The  listening  comparisons  also  indicated  that  the  slope  filter  technique  caused 
a  considerable  amount  of  roughness  in  speech  quality.  This  was  apparent  not  only 
in  the  listening  comparisons  made  of  the  ten  subjects  but  on  a  great  many  other 
speech  samples  as  well.  It  is  felt  that  by  prefiltering  the  speech  the  peaks  corres¬ 
ponding  to  the  periods  of  the  resulting  waveform  are  too  broad  to  accurately 
determine  pitch.  The  actual  computer  generated  pitch  contours  for  each  of  the  three 
sentences  spoken  by  the  three  speakers  used  in  the  experiment  were  plotted  in 
Fig.  9  through  Fig.  17.  A  comparison  of  pitch  data  detected  by  a  slope  filter  to  that 
detected  by  eye  showed  that  the  former  method  tended  to  generate  pitch  contours 
having  large  pitch  perturbations  while  the  latter  method  produced  contours  having 
smaller  pitch  perturbations.  Since  the  only  actual  difference  between  the  two  methods 
was  a  prefiltering  of  the  speech  by  a  slope  filter  it  must  be  concluded  that  the  slope 
filter  caused  the  roughness  in  pitch. 

The  same  graphs  illustrated  that  reconstruction  of  the  fundamental  by  full  wave 

rectification  followed  by  slope  filtering  also  produced  large  perturbations  in  pitch. 

A  comparison  of  the  pitch  contours  indicated  that  those  contours  resulting  from  full- 

wave  rectification  and  slope  filtering  had  larger  perturbations  that  those  resulting 

from  slope  filtering  alone.  Thus,  it  seemed  that  artificial  regeneration  of  the  speech 

fundamental  had  resulted  in  a  fundamental  that  had  some  unwanted  jitter  associated 
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with  it.  This  result  seems  to  support  the  results  of  R.  R.  Reisz. 

Reisz  states  that  the  first  harmonic  is  not  always  the  vibrating  frequency  of 
the  vocal  cords.  If  the  frequency  of  vibration  of  the  vocal  cords  is  maintained  con¬ 
stant  and  the  physical  shape  of  the  resonant  cavities  of  the  vocal  tract  is  also  maintained 
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constant  (as  in  a  sustained  vowel  sound)  then  the  frequency  of  the  n  harmonic  at 
the  the  mouth  opening  is  the  same  as  that  of  the  vocal  cord  wave.  However,  if  the 
physical  shape  of  the  resonant  cavities  of  the  vocal  tract  is  changing  in  time  such 
as  during  a  sequence  of  speech  sounds,  then  the  resulting  changes  in  the  phase  shift 
along  the  vocal  tract  give  rise  to  an  apparent  frequency  of  the  harmonic  component 
which  is  somewhat  different  from  its  frequency  in  the  steady  state.  Similarly, 
harmonic  frequencies  are  not  integral  multiples  during  pitch  variation.  Therefore, 
any  attempt  to  construct  the  fundamental  by  a  detection  arrangement  that  uses  just 
the  differences  in  frequency  of  harmonic  components  will  result  in  a  fundamental 
that  differs  from  the  true  fundamental.  This  will  be  true  because  the  harmonics  are 
not  strictly  multiples  of  the  fundamental  since  the  resonant  cavities  of  the  vocal 
tract  are  physically  changing.  Consequently,  correct  fundamentals  can  be  determined 
from  harmonics  in  steady  state  speech  but  not  during  transition  periods.  This  reason¬ 
ing  would  also  account  for  the  fact  that  the  pitch  contours  of  the  full  wave  rectifica¬ 
tion  detection  scheme  vary  more  rapidly  than  that  of  pitch  contours  generated  by 
slope  filtering  alone.  Therefore,  McKinney's  analysis  that  predicts  the  conditions 
for  which  the  fundamental  can  be  detected,  although  correct  for  steady  state  speech, 
does  not  apply  during  a  sequence  of  speech  sounds.  His  analysis  does  not  include 
the  fact  that  in  moments  of  transition  the  harmonics  are  not  exact  multiples  of  the 
fundamental,  but  are  also  a  function  of  the  physically  changing  vocal  tract,  and 
vocal  cords. 

V.  CONCLUSION 

A  computer  simulation  and  vocoder  tests  employing  a  vocoder  with  special 
flattening  have  shown  that  if  speech  with  a  missing  fundamental  has  the  fundamental 
restored  by  full -wave  rectification  then  this  speech  will  produce  vocoded  speech 
having  an  audible  roughness.  Surprisingly,  if  the  original  speech  to  the  vocoder 
contains  the  fundamental  and  pitch  detection  is  performed  by  slope  filtering  techniques, 
then  the  vocoder  outputs  produce  speech  containing  a  considerable  amount  of  audible 
roughness.  Pitch  as  determined  by  eye,  however,  produced  more  natural  sounding 
speech  than  either  of  the  above  methods  of  pitch  extraction.  Pitch  contours  indi¬ 
cate  that  pitch  as  determined  by  eye  has  fewer  pitch  perturbations  than  pitch  that 
has  either  been  restored  by  full -wave  rectification  or  has  been  detected  by  slope 
filtering.  These  contours  indicate  that  there  is  a  correlation  between  the  audible 
roughness  of  vocoded  speech  and  the  visible  roughness  of  the  pitch  contours. 
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Finally  the  sophisticated  pitch  extractor  of  the  Lincoln  Laboratory  vocoder 
operating  on  speech  containing  a  fundamental  produced  a  naturalness  in  speech 
comparable  to  that  produced  by  eye  detected  pitch.  Both  the  Lincoln  Laboratory 
pitch  extractor  and  pitch  extraction  by  eye  produced  speech  containing  much  less 
audible  roughness  than  either  slope  filtering  with  the  fundamental  present  or 
artificial  reconstruction  of  the  fundamental  by  full -wave  rectification. 


TIME  (c«ec) 


Fig.  9.  Pitch  Contours  of  Sentence  "We  Are  Due  Aixiut  Eight"  —  Speaker  1. 
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Fig.  10.  Pitch  Contours  of  Sentence  "We  Are  Due  About  Eight"  —  Speaker  2. 
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Fig.  11.  Pitch  Contours  of  Sentence  "We  Are  Due  About  Eight"  —  Speaker  3. 
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Fig.  12.  Pitch  Contours  of  Sentence  "Whom  Am  I  To  Meet"  —  Speaker  1. 
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Fig. 


13.  Pitch  Contours  of  Sentence  "Whom  Am  I  To  Meet"  —  Speaker  2. 
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Fig.  14.  Pitch  Contours  of  Sentence  "Whom  Am  I  To  Meet"  —  Speaker  3. 
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Fig.  15.  Pitch  Contours  of  Sentence  "I  Love  You!"  —  Speaker  1. 
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Fig.  16.  Pitch  Contours  of  Sentence  "I  Love  You!"  —  Speaker  2. 
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Fig.  17.  Pitch  Contours  of  Sentence  "I  Love  You!"  —  Speaker  3. 
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APPENDIX  1 

Description  of  Computer  Facility 

The  computer  simulations  of  this  experiment  were  performed  on  MIT's 

Electrical  Engineering  PDP-1.  This  18-bit  machine  is  programmed  for  time  sharing 
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and  has  2  registers  of  accessible  memory  with  another  2  registers  for  the 

time  sharing  routines.  Because  of  this  time  sharing  facility  it  was  relatively  easy 

to  get  time  on  the  PDP-1  for  debugging  the  program.  Since  the  actual  simulation 

required  some  real-time  operations  the  final  simulation  was  performed  in  time 

sharing  mode  with  all  other  users  turned  off.  In  this  manner  it  was  possible  to  use 

several  of  the  time -sharing  commands  and  yet  have  the  full  attention  of  the  machine 

at  all  times. 

Although  it  was  necessary  to  use  machine  language  mnemonics  in  programming 
the  computer,  many  of  the  machine's  other  features  made  this  machine  nearly  ideal. 
In  addition  to  the  core  memory  an  additional  64K  is  supplied  by  a  drum.  Speech 
could  be  read  into  the  computer  using  the  Digital  Equipment  Corporation's  8-bit  A-D 
converter  and  analog  voltages  could  be  obtained  using  its  D-A  Converter  (see 
Appendix  3).  A  cathode-ray  oscilloscope  provided  easy  visual  observation  of  the 
data  during  many  portions  of  the  simulation  and  a  light  pen  enabled  further  manipu¬ 
lation  of  this  data. 

The  instruction  list  of  the  PDP-1  provided  easy  manipulations  on  the  18-bit 
memory  words.  The  cycle  time  was  5fxs,  but  most  instructions  were  at  least  two 
cycles  long.  The  multiply  instruction  generally  took  about  30/ns.  Since  this  instruc¬ 
tion  occurred  often  in  the  simulation,  it  accounted  for  the  program  taking  approxi¬ 
mately  100  times  real  time. 

On-line  communication  with  the  machine  could  be  performed  in  a  variety  of 
ways.  This  list  included  the  typewriter  console  itself,  a  set  of  18  toggle  switches, 

6  sense  switches,  a  set  of  potentiometers  controlling  voltages  to  an  A-D  converter, 
three  other  A-D  inputs,  the  light  pen  and  finally  punched  paper  tape.  All  these  forms 
of  inputs  were  used  in  the  simulation. 

At  this  time  the  PDP-1  is  undergoing  several  modifications  to  add  to  this  list 
of  inputs  and  make  it  even  more  useful  in  speech  work. 
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Fig.  18.  Analog-Digital  Conversion  of  Speech. 


Fig.  19.  Frequency  Response  of  Analog  Cauer  Filter. 
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APPENDIX  2 


Speech  Input  and  Output  on  Computer 

Before  the  simulation  could  be  performed,  the  analog  speech  signal  on  the 
magnetic  recording  tape  input  had  to  be  converted  into  digital  data  by  an  A-D 
converter.  Only  after  this  operation  could  the  speech  in  the  form  of  numerical 
data  be  stored  in  the  memory  of  the  PDP-1.  Since  the  PDP-1  had  an  A-D  converter 
associated  with  it,  the  entire  digital  conversion  and  data  storage  could  be  done  in 
one  program. 

To  understand  this  operation  it  is  first  necessary  to  know  a  little  bit  about  the 
A-D  converter.  On  command  from  the  PDP-1  this  converter  will  sample  an  analog 
signal  between  0  and  —10  volts  and  convert  this  sample  into  an  unsigned  8-bit 
quantity.  The  computer  will  halt  during  this  conversion,  which  takes  30ps.  In  order 
to  accomplish  the  A-D  conversion,  two  things  must  then  happen.  First  the  input  to 
the  A-D  converter  must  be  between  0  and  —10  volts  and  second  the  computer  must 
be  made  to  command  the  A-D  to  sample  at  the  appropriate  rate.  This  first  require¬ 
ment  is  easily  met  in  Fig.  18.  If  we  restrict  the  analog  speech  output  of  the  tape 
recorder  to  be  +5  volts,  then,  by  biasing  this  with  a  —5  volt  source  in  series  with 
the  tape  recorder,  the  input  to  the  A-D  converter  will  be  between  0  and  -10  volts. 

To  insure  that  the  sampling  rate  of  10  kc/sec  used  in  this  simulation  was  at  least 
twice  as  high  as  the  highest  frequency  present  in  the  input,  a  Cauer  filter  with  fre¬ 
quency  characteristics  illustrated  in  Fig.  19  was  cascaded  with  the  tape  recorder. 

This  filter  had  to  pass  d.  c.  to  insure  the  presence  of  the  proper  levels  to  the  A-D 
converter. 

The  fulfillment  of  the  second  requirement  is  understood  by  reference  to  the 
flow  chart  of  Fig.  20.  The  program  is  designed  to  sample  1-1/2  seconds  of 
speech  (15,000  samples  at  a  10  kc/sec  rate)  and  store  them  by  packing  2  to  a 
word  in  the  computer  memory.  The  drum  could  not  be  used  to  store  speech  during 
read  in  because  the  drum  write  instruction  took  too  long  to  execute.  To  insure  that 
the  desired  1-1/2  seconds  of  speech  would  be  read  in,  a  threshold  detection  scheme 
was  used.  If  in  the  beginning  the  sample  read  in  was  not  above  a  certain  threshold, 

T,  the  sample  would  not  be  stored  and  a  new  sample  would  be  taken.  This  would 
continue  until  a  sample  exceeded  the  threshold  and  then  the  remaining  samples  would 
be  read  in  and  stored  in  the  machine.  In  this  way  the  silence  preceding  the  beginning  of 
the  utterance  would  not  be  stored  in  the  computer.  To  achieve  the  current  sampling 
rate,  it  was  necessary  to  make  use  of  the  internal  computer  timing.  This  was  done 
by  designing  the  read- in  program  so  that  the  total  execution  time  of  the  instructions 
was  precisely  100  jx  seconds. 
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Fig.  20.  Flow  Chart  of  Speech  Read-In  Program. 


26 


At  the  end  of  the  sampling  an  additional  program  was  necessary  to  convert 
the  stored  data  into  a  form  appropriate  for  input  to  the  digital  filters.  Since  the 
A-D  converter  presented  the  computer  with  a  positive  number  between  0  and 
377g  it  was  necessary  to  subtract  200g  from  this  data  before  use  so  that  the  data 
ranged  betwen  +  177g  and  —  177g  corresponding  to  the  tape  recorder  output  of 
+  5  volts.  This  conversion  was  performed  outside  the  sampling  loop  because  of 
timing  problems  within  the  loop. 

As  a  check  on  the  speech  data  a  second  program  using  the  D-A  converter  was 
written  to  convert  this  digital  data  into  analog  voltages  at  a  10,  000  cycle/sec  rate. 

This  analog  voltage  was  then  passed  through  the  Cauer  filter  discussed  above, 
amplified  and  fed  into  a  loudspeaker.  Thus,  it  was  possible  to  listen  to  the  speech 
data  stored  in  the  machine.  A  further  check  could  be  made  by  visually  displaying 
the  speech  data  on  the  'scope  face. 

Read  Out 

The  final  output  from  the  computer  simulation  was  a  two-track  tape  recording. 

On  one  track  was  the  analog  speech,  and  on  the  other  the  pitch  pulses  representing 
the  computed  periods.  This  recording  was  made  in  steps  illustrated  in  Fig.  21. 

The  same  speech  which  was  originally  read  into  the  PDP-1  for  determination  of  pitch 
was  again  fed  into  the  A-D  converter  using  the  technique  described  earlier.  This 
same  signal  was  also  recorded  on  the  first  track  of  the  two-track  tape.  Again  the 
analog  speech  was  sampled  by  the  PDP- 1  and  sent  through  the  same  threshold  detection 
scheme  used  before  to  determine  the  beginning  of  the  utterance  (see  flow  chart  of 
Fig.  22).  This  time,  however,  when  the  threshold  was  exceeded  the  computer,  via  the 
D-A  converter,  began  to  generate  5  volt,  50  ps  wide  pulses.  These  pulses  which 
are  made  to  occur  once  every  computed  pitch  period  of  the  wave  are  recorded  on  the 
second  track.  Note  that  the  data  recorded  on  the  two  tracks  is  being  done  by  two 
separate  means.  They  are  brought  into  synchronization  by  having  the  computer 
note  the  same  beginning  of  the  utterance  (assuming  the  threshold  is  the  same  as 
before  and  is  high  enough)  as  that  data  it  has  previously  used  to  compute  the  pitch. 

The  computer  knows  in  advance  which  samples  have  a  pitch  indication  associated 
with  them  and  by  asking  every  100  ns  (the  same  rate  as  the  original  read  in)  if  a 
pitch  pulse  should  be  generated  can  now  output  the  pulses  at  the  proper  time  with 
respect  to  the  recording  of  the  first  track. 
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Fig.  21.  System  Used  in  Recording  Speech 
and  Pitch  Pulses  in  Synchronization. 
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Fig.  22.  Flow  Chart  of  Pitch  Read-Out  Program 


Fig.  23.  Digital  System  Corresponding  to  a  Continuous 
Time  System  Having  an  Impulse  Response 
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APPENDIX  3 

Digital  Filter  Design  Techniques 

This  appendix  discusses  the  digital  filters  used  in  this  experiment.  All  the 

filters  used  have  direct  analog  counterparts.  In  fact,  these  counterparts  gave 

14 

rise  to  the  digital  filters  by  means  of  the  methods  of  z -transforms.  By  using 
the  results  of  z  transforms  it  was  possible  to  go  from  the  continuous  time  system 
having  impulse  response 

h(t)  =  Ae'at 
to  the  difference  equation 

y(nT)  =  A  x(nT)  -  e"a^  y(nT  -  T) 

where  T  is  the  sampling  interval,  which  has  an  impulse  response  which  is  that 
of  the  samples  of  the  impulse  response  of  the  continuous  time  system.  For  linear 
continuous  filters  wherein  the  impulse  response  can  be  expressed  as  a  sum  of 
exponentials,  it  is  possible  to  get  the  desired  digital  system.  For  example,  if 

-a  t  ~H 

h(t)  =  Aj  e  alc  +  A2  e 

then  the  digital  system  could  be  described  by  the  equations 

y^nT)  =  x(nT)-e"alT  y^nT-T) 

y2(nT)  =  x(nT)  -  e  -a2Ty2(nT  _  T) 
y(nT)  =  Ajy^nT)  +  A2y2(nT) 

which  is  shown  described  pictorially  in  Fig.  23. 

The  300-900  Hz  filter  used  to  eliminate  the  speech  fundamental  was  the  digital 
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counterpart  to  a  7-pole  Lerner  bandpass  filter  .  Pictorially  the  filter  is  repre¬ 
sented  in  Fig.  24.  The  constants  representing  pole  positions  of  the  digital  filter 
were  computed  using  the  PDP-1.  The  frequency  response  and  impulse  response  of 
this  filter  are  depicted  in  Fig.  25  and  Fig.  26  respectively. 


31 


r 


L 


Fig.  24.  Seven  Pole  Lerner  Bandpass  Filter  in  Digital  Form. 
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Fig.  25.  Frequency  Response  of  Seven  Pole  Lerner  Filter. 


Fig.  26.  Digital  Impulse  Response  of  Seven  Pole 
Lerner  Bandpass  Filter  (300-900  Hz). 
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Fig.  27.  Digital  Representation  of  Slope  Filter. 


Fig.  28.  Frequency  Response  of  Slope  Filter  (12  db/Octave). 
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Similar  digital  methods  were  used  to  program  the  slope  filter  used  in  the 
experiment.  Instead  of  arranging  several  digital  systems  in  parallel  as  above,  the 
systems  were  cascaded  to  produce  the  desired  responses.  See  Fig.  27.  Figure  28 
and  Fig.  29  are  respectively  the  frequency  and  impulse  responses  of  this  filter. 


Fig.  29.  Digital  Impulse  Response  of  Slope  Filter  (12  db/Octave  Falloff). 
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used  as  a  comparison.  Pitch  contours  of  the  three  methods  indicate  that  pitch  restored  by  fullwave  rectifi¬ 
cation  and  slope  filtering  has  larger  perturbations  than  pitch  as  detected  by  slope  filtering  alone.  Both 
methods  produced  pitch  contours  having  much  larger  perturbations  than  pitch  determined  visually. 

Speech  whose  pitch  was  determined  by  the  above  three  methods  was  used  to  excite  the  spectrally  flattened 
Lincoln  Laboratory  Vocoder.  Listening  tests  of  the  vocoder  output  indicate  that  pitch  restored  by  fullwave 
rectification  and  slope  filtering  produced  rougher  sounding  speech  than  speech  whose  pitch  was  detected  by 
slope  filtering  alone,  but  both  methods  produced  speech  having  considerably  more  audible  roughness  than  that 
produced  by  visually  detected  pitch.  Finally,  the  sophisticated  pitch  detector  of  the  vocoder  itself  produced 
speech  of  quality  comparable  to  that  determined  visually. 
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